QTM 447 Lecture 2: Machine Learning Basics

Kevin McAlister

January 16, 2025

Learning Algorithms

What is a machine learning algorithm?

I really like this definition (Mitchell, 1997):

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

  • In other words, a machine learning algorithm learns from data

  • We would hope that the algorithm is able to perform well at a defined task given some data

Learning Algorithms

The task:

What is it that we’re trying to learn?

  • Classification

  • Regression

  • Some twist on these two

I’m going to say that these tasks can be broken down into (roughly) one of three not mutually exclusive types of tasks:

  • Prediction

  • Description

  • Inference/Explanation

Learning Algorithms

Inference/Explanation:

Learn how parts of a system effect one another.

Most common: Causal Inference

  • Learn how one (or more) variables in a system (e.g. a collection of features and outcomes) effect one another

In the regression context, loss for inference problems is defined as:

\mathcal L(\hat{\boldsymbol \beta}, \beta)

maybe mean squared error loss:

E[\hat{\boldsymbol \beta} - \boldsymbol \beta)^2]

Learning Algorithms

Note that explanation centers wholly on the values of coefficients!

  • Truthfully uncover the size of an effect

There is a subset of machine learning that centers on this concept

  • Causal machine learning

    • The “deconfounder” (Wang and Blei, 2019)

    • Causal Forests (Athey et al, 2019)

    • Matrix Completion for Panel Data (Athey, 2021)

    • Synthetic controls (Xu, 2018)

    • Causal Discovery (Friedman et al, 2000)

Interesting. But, not the focus of this class.

Learning Algorithms

Description

Given a complex set of features, mathematically represent those features in an understandable way

Most common: Dimensionality reduction and density estimation

Most frequently unsupervised methods (the experience)

  • We only see (or use) a set of features

  • There is no meaningful outcome to be trained on

  • No supervision and no real loss function(-ish)

Learning Algorithms

Dimensionality Reduction

We have \mathbf X, a N \times P feature matrix, where P is quite large. We believe that we can meaningfully represent X as \boldsymbol \Omega , a N \times M matrix where M << P.

  • Think PCA

Density Estimation

We have \mathbf X, a N \times P feature matrix. We would like to use these examples to make a guess as to the true P dimensional probability distribution of the features - learn f(\mathbf x).

  • Think fitting a multivariate normal distribution to a collection of N samples like in QDA.

Learning Algorithms

Recall from last class that we have many tasks where we would like to learn f(\mathbf x) - the marginal distribution of the features

  • Not trying to use them to create predictions of an outcome, directly

These are largely descriptive tasks!

It’s not just PCA and matrix factorization methods.

  • There’s a lot of neat stuff here that we’ll leverage for generation methods

Learning Algorithms

Example: Word Embeddings

Let’s suppose that we want to perform some sort of analysis on a collection of texts

Problem: The set of “word” features can be quite high dimensional

Method: GLOVE

  • Converts a collection of tokens from documents to a lower dimensional space

  • Each token is represented in this lower dimensional space and is organized to place similar words close to one another and words that aren’t contextually similar far apart

  • For details, see PML 20.5.2

Learning Algorithms

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Convert the GloVe file to word2vec format
glove_input_file = 'glove.6B.100d.txt'  # replace with your GloVe file path
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# Load the model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# Define a list of words
words = ['king', 'queen', 'man', 'woman']

# Get the embeddings for the words
embeddings = [model[word] for word in words]

# Use PCA to reduce the dimensionality to 2 dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot the 2D embeddings
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])

# Add labels to the points
for i, word in enumerate(words):
    plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.show()

Learning Algorithms

Learning Algorithms

Description is a problem of learning!

  • The more examples we see, the better we are able to describe what we see.

It just requires its own language compared to predictive models

  • We’ll introduce this language as it arises!

Learning Algorithms

The most common type of task is predictive

Given some training data, \{\mathbf x_i , y_i\}_{i =1}^N, learn a function that maps the training features to the training outcomes in such a way that it will do a good job for an instance not included in the training data, \{\mathbf x_0 , y_0\}.

  • The class of tasks you are probably most comfortable with!

The goal of a predictive model is to learn from a set of training instances in a way that tells something about the true data generating process

  • Separates signal from noise.

Learning Algorithms

When we think predictive modeling, we are most often thinking about supervised learning

  • The features are associated with a true outcome that could be measured

  • We can compare the value of our prediction to the truth to judge how well our choice of model has worked

Most often, we’re going to assume that the experience that the algorithm receives is a single training set with all instance labeled

  • Each observation is associated with a true outcome

Learning Algorithms

This need not always be the case:

  • Semi-supervised learning: Some instances are labeled while others are not (large scale image classification)

  • Multi-instance learning: Learn if the training set contains at least one instance of the desired outcome without creating a prediction for each instance (Image of a beach - some sand instances, some water instances)

  • Reinforcement learning: A system constantly labels its experiences for the sake of learning to perform a task (AI teammates in DOTA)

Learning Algorithms

We say that a learning algorithm is a good one if it performs well

  • It does a good job of predicting the outcome for an instance that was not included in the training data

This is not well-defined without a mathematical definition of performance

  • With respect to what?

Predictive Performance

The generic predictive problem:

y = f(\mathbf x) + \epsilon

where f(x) is the signal that can be explained with the feature set and \epsilon is a random noise term that is zero centered.

Goal: Use the training data to learn about the signal function and ignore the noise

  • Why ignore the noise?

Predictive Performance

The problem is that the outcome we see isn’t conveniently separated into signal and noise

  • What we get is a mixture of the two

This is what we need to learn!

  • Some summary of the distribution over the possible outcomes!

Predictive Performance

We need a metric to define success.

The best way to do this is to leverage a proper loss function

  • A function that equates to zero if the arguments are equivalent

  • The further the two arguments are from one another, the higher the loss function

We use a learning method to create a guess for the data generating process - \hat{f}(\mathbf x)

And compare our guess to the truth that we observe - y = f(\mathbf x) + \epsilon

Predictive Performance

Most common loss functions:

For continuous outcomes, we most frequently use mean squared error loss. Given N examples that we would like to assess

\frac{1}{N} \sum \limits_{i = 1}^N (y_i - \hat{f}(\mathbf x_i))^2

Predictive Performance

For discrete outcome sets, we can expect to potentially get our guess exactly correct. Here, we use 0/1 loss

\frac{1}{N} \sum \limits_{i = 1}^N I(\hat{f}(\mathbf x_i) \neq y_i)

  • I(\cdot) denotes the indicator function that equals 1 if the argument is true and 0 otherwise.
  • Why doesn’t 0/1 loss work particularly well for continuous outcomes?

Predictive Performance

For 0/1 loss, it is often more convenient to leverage the law of iterated expectations to put things in probability form. With K categories:

\frac{1}{N} \sum \limits_{i = 1}^N I(\hat{f}(\mathbf x) \neq y) = \frac{1}{N} \sum \limits_{i = 1}^N \sum \limits_{k = 1}^K I(k \neq y) Pr(\hat{f}(x) = k)

\frac{1}{N} \sum \limits_{i = 1} 1 - Pr(\hat{f}(\mathbf x_i) = y)

or the probability that we make the incorrect choice for each instance.

Predictive Performance

For reasons we’ll soon see, this is very closely related to another classification loss metric called cross-entropy loss:

- \frac{1}{N} \sum \limits_{i = 1}^N \sum \limits_{k = 1}^K I(y_i = k) \log Pr(\hat{f}(\mathbf x_i) = k)

which is just a reworking of the probability loss function from the previous slide.

Predictive Performance

Given a ground truth set of outcomes, it makes sense to:

  • Choose a loss function for our task

  • Choose the function that minimizes this loss

What we are given in a supervised learning problem:

  • A set of training features, \mathbf X

  • And the corresponding set of training labels, \mathbf y

The goal:

Uncover a function that does a good job of creating predictions for data that was not seen in the training set

Predictive Performance

Let’s start with a comfortable method: Decision Trees

As a review, given a feature set, \mathbf X, and a vector of classes, \mathbf y:

  • Start with the null model

  • Find the split on one feature that minimizes the classification loss on either side of the split; creates two partitions

  • Within each new partition, find the split that minimizes the loss; split again

  • Split until a fixed number of nodes is reached or until the tree puts every observation into its own node

Predictive Performance

Cross-entropy loss for decision trees requires using an empirical probability measure:

  • Within a node, say that the probability that a prediction within that partition belongs to each class corresponds to the empirical proportion of obs within the node that are in each class

  • Pretty simple!

Predictive Performance

Let’s start by creating a true data generating process.

Predictive Performance

Let’s generate some training data from the true data generating process - 5000 points

Predictive Performance

We can split any algorithm into the sets of knowns and unknowns.

For binary decision tree classifiers:

  • Knowns: \mathbf X and \mathbf y \in (0,1)

  • Unknowns: Number of terminal nodes in the tree

We need to pick the number of terminal nodes!

  • Let’s do this to create the most generalizable model.

Predictive Performance

Predictive Performance

Predictive Performance

According to the training loss, fit a tree with 500 terminal nodes!

  • Any thoughts as to why this is the case?

We know, though, that this isn’t what we’re after!

The predictive goal:

  • For data that was not used to train the model

  • Maximize the generalizability of the model

From the true DGP, we know that it is unlikely that it is possible to get these predictions 100% correct everywhere.

Predictive Performance

A reasonable metric of generalizability is the average loss across the entire space of possible feature combinations.

  • x \in ([-3,3] \times [-3,3])

Predictive Performance

Specifically, let’s assume that each \mathbf x and y are each i.i.d. draws from a common joint distribution, \mathcal T

This then equates to:

E_{\mathcal T}[\mathcal L(\hat{f}(\mathbf x) , y)]

A measure of generalization error

  • If I was able to take another draw from \mathcal T, without knowing what the actual feature/outcome pair is, what is the probability that I make the incorrect choice?

  • Note that expectation of an indicator function is just probability of the event (think about it…)

Predictive Performance

Here, we’ll use a large set of data drawn from the same DGP to evaluate this expectation

  • Evaluate the class admitted by the trained decision tree classifier at 100,000 new points drawn from the same DGP
  • LLN says that this will get close to the expectation

Predictive Performance

Predictive Performance

In evaluating this expectation, the 500 node model is no longer the best!

The 75 node model is the best

And all of the lower node models are better than the 500 node model!

Predictive Performance

Predictive Performance

What’s going on here?

  • Why is 500 not the winner in terms of generalization?

  • Why is 2 not the winner?

Predictive Performance

We know that we’re overfitting with 500 nodes and underfitting with 2 nodes

  • What does overfitting mean?

  • What does underfitting mean?

Predictive Performance

This is all review.

Over the rest of this class and the next, I want to get real formal about our definitions of overfitting and underfitting.

  • How these terms relate to specific guarantees about training error and generalization error

  • How these terms relate to the size of the training data and the flexibility of the chosen method

Predictive Performance

A motivating example:

The previous exercise was done with N = 5000. What if we replicate the analysis for different training set sizes?

  • 500 training samples

  • 5000 training samples

  • 50000 training samples

What do you think is going to happen to the optimal number of nearest neighbors in terms of generalization error?

Predictive Performance

Predictive Performance

Predictive Performance

Predictive Performance

Predictive Performance

Predictive Performance

Predictive Performance

With 500 samples, the best model in terms of generalization is very simple with just a few nodes

  • Just better to not try to really pick up complexity; just get some generic boxes and go with that

With 5000 samples, 75 nodes and a little shape

  • Still better to not try to account for too many curves

With 50,000 samples, a lot of nodes and a lot of complexity is optimal!

There appears to be a relationship between the sample size and how flexible the optimal model can be.

Generalization Error

To further explore this relationship, we’ll need to get technical about some definitions and assumptions.

Big Assumption #1

Let’s assume that we draw a training set of size N. Each \{\mathbf x_i , y_i\} is an independent and identically distributed draw from \mathcal T

  • \mathcal T is the true joint distribution of our features and the outcome dictated by the giant random number generator in the sky

  • Dictates all of the possible combinations that we could see and the probability with which we expect to see each one

  • A giant multivariate distribution, f(\mathbf x ,y)

Generalization Error

This i.i.d. assumption makes a lot of sense

  • If we want to use the training sample to learn about the way in which the outcome relates to the features, we need to assume that the training sample is representative of the true distribution

  • Can’t say anything about the general case if we don’t believe that our training sample represents the general case…

Generalization Error

In the supervised learning problem, we want to learn f(y | \mathbf x) - the conditional distribution of the outcome conditional on the features we observe

By assumption:

y = f(\mathbf x) + \epsilon

  • Some fixed function of the features plus a noise term

Under this construction, it can sometimes be useful to think about each y as a random variable since \mathbf x is assumed given

  • The y we observe conditional on the function and \mathbf x

E[y + \epsilon] = y \text{ ; } V[y + \epsilon] = \sigma^2

Generalization Error

If we assume that the noise is additive, random, and has zero mean, then:

E[y | \mathbf x] = E_y[f(y | \mathbf x)] = f(\mathbf x)

  • The noise is variation in the outcome that is not explained by \mathbf x

  • The expectation of a conditional distribution is only a function of the givens.

Generalization Error

We use a learning method (KNN, linear regression, logistic regression, etc.) to create an approximation to f(\mathbf x) given the training data, \{\mathbf X_D , \mathbf y_D \}.

Let’s call this function \hat{f}(\mathbf x).

  • The prediction is generated from this function

For a given input, we’ll denote the prediction as \hat{y} = \hat{f}(\mathbf x)

Generalization Error

Using \hat{f}(\mathbf x), we can assess the training error

  • How well does the chosen model fit the data that was used to train it?

Expectations are nice to work with, so we’ll define the training error as:

E_D[\mathcal L(\mathbf y_D , \hat{\mathbf y}_D)]

  • \mathcal L(\cdot) is a proper loss function

  • This is the expected value over the data we observed in our observed training set

  • Since this is fixed and observed, we can substitute this for the empirical average

    E_D[L(\mathbf y_D , \hat{\mathbf y}_D)] = \frac{1}{N} \sum \limits_{i = 1}^N L(y_i , \hat{y}_i)

Generalization Error

But, this is not our true loss function to minimize!

We want to understand how well our function fits any data that it might encounter.

This is where we need to go back to \mathcal T.

  • \mathcal T tells us any and all possibilities that we could see and the probability with which those different outcomes could be generated.

  • It includes all of the potential feature pairs we could observe.

  • It also includes all of the noise draws that we could observe conditional on the features.

Generalization Error

A good predictive function should work well for any possible feature/outcome combination we could potentially observe

  • e.g. be generally good

A reasonable measure of generalization error, then, is:

E_\mathcal T [ \mathcal L(y , \hat{y})]

  • Weighted for probability of occurrence, what is the average loss we would observe using \hat{f}(\mathbf x) for any possible feature/outcome combination we could observe?

  • Upweight likely occurrences (a man in retail,no trust fund, brown eyes, 5’8”)

  • Downweight unlikely occurrences (a man in finance, trust fund, blue eyes, 6’5”)

Generalization Error

We’re going to define i.i.d. generalization error as:

E_\mathcal T [ L(y , \hat{y})]

The expected loss with respect to the true data generating distribution.

The problem: We don’t know what \mathcal T is!

  • If we did, we’d be done here.

  • All we get are the N samples from \mathcal T that we observe in the training set.

Generalization Error

We don’t get to directly compute E_{\mathcal T} [\mathcal L(y , \hat{y})]

But, we do get to see E_D[\mathcal L(\mathbf y_D , \hat{\mathbf y}_D)]

Can we use the training error to inform us about the generalization error?

Or can we directly compute the generalization error?

Generalization Error

Let’s start with a generic truth: the generalization error will always be greater than or equal to the training error

  • We know that this is true.

Proving this is true is a bit difficult.

But, we can build a proof by making an assumption beyond the i.i.d. assumption

Convenience Assumption - Fixed X

Let’s assume that we have two different i.i.d. data sets pulled from \mathcal T of size N, \{\mathbf X_D , \mathbf y_D\} and \{\mathbf X_D , \mathbf y_D'\}

The training features are the same

But the actual outcomes differ due to different noise draws

y = f(\mathbf x) + \epsilon \text{ ; } y' = f(\mathbf x) + \epsilon'

Generalization Error

When N is large and we generate two different data sets from \mathcal T, this assumption isn’t that bad!

  • Large N with random draws means good coverage of \mathcal T.

  • This result won’t hold exactly with small N

  • Such is life

\mathbf y is used to generate the predictions, \hat{f}(\mathbf x) = \hat{y}

  • The only difference is how close the predictions are to the different outcomes

Another note:

As N gets large

E[L(y'_D , \hat{y}_D)] \rightarrow E_\mathcal T[ L(y , \hat{y})]

Generalization Error

The proof for MSE (and you can show that this holds for 0/1 loss just by plugging in 0s and 1s):

Test Error

Compute the MSE test error as:

\frac{1}{N} \sum \limits_{i = 1}^N E_{y'}[(y_i' - \hat{f}(\mathbf x_i))^2]

as an expectation w.r.t. y' because this is the separate random component for the test data

Training Error

Compute the MSE training error as:

\frac{1}{N} \sum \limits_{i = 1}^N (y_i - \hat{f}(\mathbf x_i))^2

Generalization Error

Under this construction, treat the outcome as the only random component.

Goal: What is the gap between the training error and the test error?

\omega = E_y\left[\underset{\text{Test Error}}{\frac{1}{N} \sum \limits_{i = 1}^N E_{y'}[(y_i' - \hat{f}(\mathbf x_i))^2]} - \underset{\text{Training Error}}{\frac{1}{N} \sum \limits_{i = 1}^N (y_i - \hat{f}(\mathbf x_i))^2} \right]

Generalization Error

Expand the squares and use linearity of expectation to get:

\omega = \biggl[\frac{1}{N} \sum \limits_{i = 1}^N E_yE_{y'}[y_i'^2] + E_yE_{y'}[\hat{y}_i^2] - \\ 2E_yE_{y'}[y_i' \hat{y}_i - E_y[y_i^2] - E_y[\hat{y}_i^2] + 2E_y[y_i \hat{y}_i]\biggr]

The distribution (and moments) over y and y' are the same under this construction. Additionally, the predictions don’t have anything to do with y'. So we can further reduce to:

\omega = \left[\frac{1}{N} \sum \limits_{i = 1}^N E_y[y_i^2] + E_y[\hat{y}_i^2] - 2E_y[y_i] E_y[\hat{y}_i] - E_y[y_i^2] - E_y[\hat{y}_i^2] + 2E_y[y_i \hat{y}_i]\right ]

Generalization Error

Cancelling terms and pulling out a 2:

\omega = \left[\frac{2}{N} \sum \limits_{i = 1}^N E_y[y_i \hat{y}_i] - E_y[y_i] E_y[\hat{y}_i]\right ]

Recall that the covariance of two zero-centered random variables can be defined as:

Cov(a,b) = E[ab] - E[a]E[b]

Generalization Error

So, we’re left with:

\omega = \frac{2}{N} \sum \limits_{i = 1}^N Cov(y_i , \hat{y}_i)

This means that:

\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum \limits_{i = 1}^N Cov(y_i , \hat{y}_i)

Generalization Error

This means that:

\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum \limits_{i = 1}^N Cov(y_i , \hat{y}_i)

where the covariance term is the correlation between the training outcomes and the predictions made at each \mathbf x_i from the training data.

  • For any good predictive model, we expect that these two things are positively correlated.

  • The higher the covariance of the predictions and the training inputs, the larger the gap between generalization error and training error

Generalization Error

A model with low generalization error balances and minimizes:

  • Training error

  • Covariance between the prediction and the training outcome

Training error is minimized when the chosen model is as flexible as possible!

Generalization Error

When might we expect the covariance between a prediction at \mathbf x and the observed outcome at that point to be high?

Note:

For all learning algorithms, the prediction at any point is a function of the training features and outcomes and the associated feature vector.

\hat{y}_0 = g(\mathbf x_0, \mathbf X , \mathbf y)

Let’s explore this with another example.

Generalization Error

Polynomial Regression with 1 feature:

y = \alpha + \sum \limits_{j = 1}^D \beta_j x^j

As the polynomial degree increases, the wiggliness of the function increases.

The feature matrix grows through basis expansion as the degree increases. For a third degree polynomial:

h(\mathbf x) = [1 , x , x^2, x^3]

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Let’s assume that we estimated the coefficients using ordinary least squares on the training data:

\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

So, the prediction at any point \mathbf x_0 becomes:

\hat{y}_0 = \mathbf x_0^T \hat{\boldsymbol \beta} = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

  • Each prediction depends on the vector of training outcomes!

  • The training features and value at which the prediction is made create weights for the training outcomes

  • The prediction at any point is a weighted combination of the training outcomes.

Generalization Error

We want to learn the covariance between the training outcomes and the predictions made for the training data

\sum Cov(\hat{ y}_i, y_i)

where

\hat{\mathbf y} = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

So, we’re trying to compute

Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y]

  • Under the fixed \mathbf X assumption, each outcome is the only random part of this expression

Generalization Error

Since covariance is a linear operator:

Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T Cov[\mathbf y,\mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \sigma^2 \mathcal I_N

  • An N \times N covariance matrix with the self correlations along the main diagonal

Generalization Error

A simplification since \sigma^2 is a constant:

\sigma^2 \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathcal I_N = \sigma^2 \text{tr}\left(\mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \right)

where the trace is the sum of the diagonal elements.

Generalization Error

For any linear model, we can compute the sum of the self-covariances as:

\sigma^2 \text{tr}\left(\mathbf X(\mathbf X^T \mathbf X)^{-1} \mathbf X^T\right)

where \text{tr}() is the trace of the matrix (the sum of the diagonal elements

  • Anyone know what the trace of this matrix is if \mathbf X is a N \times P matrix?

Generalization Error

The trace is invariant to cyclic transformation

\sigma^2 \text{tr}\left((\mathbf X^T \mathbf X)^{-1} \mathbf X^T\mathbf X\right)

The sum of the self-covariances is then:

P \sigma^2

and the training error offset term is:

\frac{2P\sigma^2}{N}

Generalization Error

Let’s think about what this means:

  • As the number of implied features increases, the covariance between the training outcomes and the predictions increases

  • Put another way, the more implied features in the model, the more the prediction at \mathbf x depends on the value of the outcome at that point and the less it depends on the the value of the outcome at surrounding points

  • For the polynomial model, as the degree increases, more terms.

  • This makes sense because we’re overresponding to wiggliness in the training data

  • Overfitting

  • This effect is diminished as N gets larger

Generalization Error

Three factors that effect generalization performance

  • Training error

  • The covariance of training outcomes and predictions

  • The sample size

Lower any of these (or increase the sample size) and decrease the generalization error!

Generalization Error

Generated from a 3rd degree polynomial with \sigma^2 = 9.

Generalization Error

Generalization Error

Generalization Error

This is great!

  • We understand a little about the link between training error and the generalization error

But, this result really only applies when we can compute the covariance of the prediction and the outcome

  • What is this covariance for an SVM with an RBF kernel?

  • What is this covariance for a random forest?

To make more headway here, we need to present more general results.

Next Time

The bias/complexity tradeoff!

  • It’s always there (but goes away as N \rightarrow \infty)

  • Deep learning is cool and all, but we need a lot of data to justify the flexibility…